Describing patterns, answering questions

Ben Whalley, Paul Sharpe, Sonja Heintz

Overview

Techniques covered

  • We collect data to answer questions
  • The first step is to describe and visualise pattens in our data
  • Common descriptions include measures of:
    • central tendency and spread of continuous variables (and of differences in these values)
    • the frequency of categorical responses
    • the relationships between variables (correlation)
  • After that, we can quantify our confidence (see next workshop)
  • Visualisation is an undervalued method for understanding data

An important task for researchers is to answer questions using data. We can often divide this activity into:

  1. Describing patterns in the data
  2. Quantifying how sure we are about those patterns

This session is all about describing and visualising patterns in the data to answer research questions.

In this session we will cover four techniques psychologists use to answer research questions.

  1. summarising numeric variables by central tendency and spread
  2. calculating the frequency of categorical responses
  3. calculating differences between scores or groups
  4. describing relationship between two variables

In the past psychologists have often neglected the first part (spotting and describing patterns) and jumped straight to the second — for example, they have been very keen to run hypothesis tests and calculate p values…

More recently, researchers have placed much more emphasis on describing and visualising the data — to really get a feel for the patterns they see — before trying to quantify the evidence it provides or make inferences from it.

We have already seen how to implement some of these techniques in R (e.g. using summarise()) and with ggplot.

However, more important than any specific technique in R, is this idea that we collect data to answer questions (not just for it’s own sake!).

Central tendency and spread

  • the central tendency of the data describes the “middle” of a set of values
  • the mean and median are the most common measures
  • measures of spread or distribution of the data show where most of the values fall (i.e. what range of values are most likely)
  • common measures are the standard deviation or interquartile range
  • we have already seen how to calculate these statistics using summarise() and group_by()
# this is a recap of earlier material

# calculate an average
# typical weight at baseline in the FIT trial
funimagery %>%
  summarise(mean(kg1))
  mean(kg1)
1  90.70536
# boxplots show the interquartile range (IQR) as the height of the box.
# The IQR is the range which includes 50% of the data points
funimagery %>%
  ggplot(aes(intervention, kg1)) +
  geom_boxplot() +
  scale_y_continuous(n.breaks = 10) # this extra line just adds more marks on the y-axis

If we have a single continous variable — that is, one stored in a numeric column in R — then can describe a few things about it, including:

  1. the central tendency of the data: e.g. mean, median (see here for refresher)
  2. the spread or distribution of the data: e.g. the standard deviation or interquartile range (see refresher here)

It’s important to remember that even simple descriptive statistics like the mean or standard deviation enable us to answer research questions — you don’t always need fancy statistics! For example, if we consider the funimagery data describing the RCT of functional imagery training, we could ask:

  • “what was the typical weight of participants at baseline?” or
  • “what was the range in which most participants’ weight fell?”

In part, you have already seen how techniques like group_by() and summarise(), or graphs like boxplots, can help calculate and present these descriptive statistics.

# (this is a recap of earlier material)
# typical weight at baseline
funimagery %>%
  summarise(mean(kg1))
  mean(kg1)
1  90.70536


# a boxplot showing the IQR as the box. The IQR includes 50% of participants
# so, we can see 50% of participants weighed between 80 and 100kg at baseline
funimagery %>%
  ggplot(aes(intervention, kg1)) +
  geom_boxplot() +
  scale_y_continuous(n.breaks = 10) # this extra line just adds more marks on the y-axis

Describing differences

The previous table and boxplot showed patients’ weights at the start of the study.

There is also a variable in this dataset called weight_lost_end_trt, which shows how much weight people lost between starting and completing FIT or MI. In a previous session we made a boxplot like this:

However, in clinical trials, it’s important to measure participants for longer periods to judge whether the effect of a treatment is sustained.

Interventions for obesity and overweight can be successful, but patients may later regain weight (Hall & Kahan, 2018). And estimating how long weight loss is sustained for is important because it changes the long term prognosis of patients, and so how cost-effective an intervention is.

The funimagery data come from a study which followed people for 6 months after completing treatment (12 months after joining the study). The kg1 column records weights at baseline, and thekg3 columns records observations made at the end of follow-up

This means we can calculate weight loss from baseline to follow-up (not just the end of treatment, which has already been done for us).

To do this we need to create a new column in our dataset. Let’s call this weight_lost_end_followup.

To calculate this new column we need to subtract weight at baseline (kg1) from weight at the end of follow-up (kg3).

In R, we can do this with the mutate function:

# use `mutate` to create a NEW COLUMN of data
# this code shows the result just below the code chunk
funimagery %>%
  mutate(weight_lost_end_followup = kg3 - kg1)
    gender age   kg1   kg2   kg3 person intervention weight_lost_end_trt weight_lost_end_followup
1        f  44 107.8 106.7 106.0      4           MI                -1.1                     -1.8
2        f  32 107.0 105.4 105.9      5           MI                -1.6                     -1.1
3        f  33  99.5 101.0  98.8      6           MI                 1.5                     -0.7
4        f  21  80.0  79.0  78.0      7           MI                -1.0                     -2.0
5        f  27  81.0  80.0  80.0      8           MI                -1.0                     -1.0
6        f  56  59.0  57.0  60.0      9           MI                -2.0                      1.0
7        f  50  95.0  92.0  92.0     10           MI                -3.0                     -3.0
8        m  57  90.0  87.0  87.0     11           MI                -3.0                     -3.0
9        f  34  87.0  87.0  86.4     12           MI                 0.0                     -0.6
10       f  25 121.2 123.0 119.7     13           MI                 1.8                     -1.5
11       m  70  84.0  81.9  84.0     14           MI                -2.1                      0.0
12       f  56 100.0  97.2  98.0     15           MI                -2.8                     -2.0
13       f  55  87.4  85.0  84.9     16           MI                -2.4                     -2.5
14       f  43 100.4  99.1 100.0     17           MI                -1.3                     -0.4
15       m  37  89.5  91.0  91.5     18           MI                 1.5                      2.0
16       m  45  84.2  85.6  86.2     19           MI                 1.4                      2.0
17       f  60  92.3  88.0  88.0     20           MI                -4.3                     -4.3
18       f  51 112.0 109.0 109.0     21           MI                -3.0                     -3.0
19       f  21  76.0  75.0  76.2     22           MI                -1.0                      0.2
20       m  43  89.5  87.2  88.0     23           MI                -2.3                     -1.5
21       f  21 118.0 116.7 115.0     24           MI                -1.3                     -3.0
22       m  19 102.4  99.4 100.0     25           MI                -3.0                     -2.4
23       m  65  79.4  80.2  79.5     26           MI                 0.8                      0.1
24       m  40  88.0  85.9  87.8     27           MI                -2.1                     -0.2
25       f  45  85.0  83.0  80.0     28           MI                -2.0                     -5.0
26       f  66  67.5  65.4  65.6     29           MI                -2.1                     -1.9
27       f  38  91.3  91.5  89.0     30           MI                 0.2                     -2.3
28       f  55  88.3  85.0  87.1     31           MI                -3.3                     -1.2
29       f  52  86.2  85.0  85.0     32           MI                -1.2                     -1.2
30       f  41  77.4  76.7  80.1     33           MI                -0.7                      2.7
31       m  40  79.0  76.5  80.0     34           MI                -2.5                      1.0
32       m  44  99.0  98.2  95.0     35           MI                -0.8                     -4.0
33       f  51  87.0  87.0  85.0     36           MI                 0.0                     -2.0
34       f  56  85.0  82.3  85.2     37           MI                -2.7                      0.2
35       m  40  69.0  66.1  67.0     38           MI                -2.9                     -2.0
36       f  30 119.0 127.0 123.0     39           MI                 8.0                      4.0
37       m  51 103.5 104.7 103.0     40           MI                 1.2                     -0.5
38       m  35  73.7  71.6  71.2     41           MI                -2.1                     -2.5
39       f  24  87.0  84.7  86.0     42           MI                -2.3                     -1.0
40       f  36 131.3 132.0 132.0     43           MI                 0.7                      0.7
41       f  47 120.0 120.0 122.0     44           MI                 0.0                      2.0
42       f  41  87.5  86.2  85.0     45           MI                -1.3                     -2.5
43       f  54  90.0  87.7  86.0     46           MI                -2.3                     -4.0
44       f  20  90.0  88.0  88.0     47           MI                -2.0                     -2.0
45       f  60  87.0  85.2  84.0     48           MI                -1.8                     -3.0
46       f  33  83.0  81.0  80.0     49           MI                -2.0                     -3.0
47       f  23  72.0  71.2  71.0     50           MI                -0.8                     -1.0
48       f  50  75.0  72.0  70.0     51           MI                -3.0                     -5.0
49       f  45  69.0  69.0  68.0     52           MI                 0.0                     -1.0
50       f  34  85.4  78.8  74.7     53           MI                -6.6                    -10.7
51       f  35  76.5  76.3  78.0     54           MI                -0.2                      1.5
52       m  56  74.1  70.8  72.0     55           MI                -3.3                     -2.1
53       f  33  93.4  94.5  93.5     58           MI                 1.1                      0.1
54       f  56  82.1  76.0  74.0     59          FIT                -6.1                     -8.1
55       f  20  90.5  88.0  84.5     60          FIT                -2.5                     -6.0
56       f  60 120.4 109.2  98.0     61          FIT               -11.2                    -22.4
57       f  59  97.2  94.6  90.0     62          FIT                -2.6                     -7.2
58       f  45  78.0  74.8  68.0     63          FIT                -3.2                    -10.0
59       f  39  79.9  75.8  68.5     64          FIT                -4.1                    -11.4
60       m  40 101.2  98.7  95.3     65          FIT                -2.5                     -5.9
61       f  25 111.0 100.0  90.0     66          FIT               -11.0                    -21.0
62       m  28  79.6  83.8  86.4     67          FIT                 4.2                      6.8
63       m  22  62.3  63.4  60.0     68          FIT                 1.1                     -2.3
64       f  26  76.8  70.0  65.0     69          FIT                -6.8                    -11.8
65       m  42  95.9  97.9  85.0     70          FIT                 2.0                    -10.9
66       f  35  84.7  82.0  80.0     71          FIT                -2.7                     -4.7
67       m  70 116.4 111.8 113.5     72          FIT                -4.6                     -2.9
68       f  46 100.2  94.0  90.0     73          FIT                -6.2                    -10.2
69       m  22 140.5 138.7 146.8     74          FIT                -1.8                      6.3
70       f  46  91.7  79.3  75.2     75          FIT               -12.4                    -16.5
71       f  60  85.4  80.2  70.0     76          FIT                -5.2                    -15.4
72       f  43  86.7  85.7  80.0     77          FIT                -1.0                     -6.7
73       f  58 112.6  95.0  98.5     78          FIT               -17.6                    -14.1
74       m  40 113.9 104.0  95.0     79          FIT                -9.9                    -18.9
75       f  60  75.5  70.0  70.0     80          FIT                -5.5                     -5.5
76       f  23  99.5  96.5  90.0     81          FIT                -3.0                     -9.5
77       f  55 101.5  96.1  97.0     82          FIT                -5.4                     -4.5
78       m  27  95.9  87.2  84.1     83          FIT                -8.7                    -11.8
79       f  33  94.9  90.9  92.0     84          FIT                -4.0                     -2.9
80       f  40  83.5  74.0  73.9     85          FIT                -9.5                     -9.6
81       f  50 107.0 104.4  95.0     86          FIT                -2.6                    -12.0
82       f  53  77.8  73.2  78.1     87          FIT                -4.6                      0.3
83       f  69  89.8  86.9  80.0     88          FIT                -2.9                     -9.8
84       m  48  88.9  85.2  85.0     89          FIT                -3.7                     -3.9
85       f  58 103.9  98.7  97.0     90          FIT                -5.2                     -6.9
86       m  35  64.0  59.0  58.0     91          FIT                -5.0                     -6.0
87       f  53  74.0  70.0  65.0     92          FIT                -4.0                     -9.0
88       m  36  98.3  89.0  94.0     93          FIT                -9.3                     -4.3
89       f  24  88.4  82.0  75.0     94          FIT                -6.4                    -13.4
90       f  46  88.6  84.6  85.0     95          FIT                -4.0                     -3.6
91       f  20  94.9  90.1  90.0     96          FIT                -4.8                     -4.9
92       f  62  76.3  72.1  72.0     97          FIT                -4.2                     -4.3
93       f  51 110.0 100.5  99.0     98          FIT                -9.5                    -11.0
94       f  42  82.0  78.2  70.0     99          FIT                -3.8                    -12.0
95       f  44 103.0  87.4  79.0    100          FIT               -15.6                    -24.0
96       f  23  95.9  93.6  93.2    101          FIT                -2.3                     -2.7
97       f  20  95.9  87.2  94.2    102          FIT                -8.7                     -1.7
98       f  56  90.4  79.0  85.0    103          FIT               -11.4                     -5.4
99       f  42  68.0  68.0  73.8    104          FIT                 0.0                      5.8
100      f  35 125.9 122.3 124.5    105          FIT                -3.6                     -1.4
101      f  64  82.4  77.8  81.8    106          FIT                -4.6                     -0.6
102      f  29  70.8  63.9  60.0    107          FIT                -6.9                    -10.8
103      f  72  73.3  73.5  73.4    108          FIT                 0.2                      0.1
104      m  48  78.7  79.5  70.0    109          FIT                 0.8                     -8.7
105      m  46  84.0  80.0  75.3    110          FIT                -4.0                     -8.7
106      f  66  72.6  68.0  64.0    111          FIT                -4.6                     -8.6
107      f  69  89.4  83.6  86.0    112          FIT                -5.8                     -3.4
108      f  64 121.4 114.0 114.0    113          FIT                -7.4                     -7.4
109      m  39 101.8  97.1  94.0    114          FIT                -4.7                     -7.8
110      m  50  80.7  75.2  74.1    115          FIT                -5.5                     -6.6
111      f  54  80.5  84.3  82.0    116          FIT                 3.8                      1.5
 [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
  • run the code above and show students the result
  • point out that this has not been stored anywhere — just displayed in the RStudio GUI, below the code chunk

What mutate does is to make a copy of our dataset, but with a new column in. That is, it always gives us back a new dataset.

We almost always want to STORE this new copy of the dataset so we can use the new column that was created. To do this we assign the result of mutate funimagery dataset by assigning the result of mutate()) to a new variable (the ‘container’ type of variable).

The assignment operator is the left hand arrow, <-:

funimagery.edited <- funimagery %>%
  mutate(weight_lost_end_followup = kg3 - kg1)
  • show how a new variable has been created in the Environment window

So the code above:

  • takes the funimagery data and pipes it to the mutate() function, which
  • adds a new column, called weight_lost_end_followup
  • this new column is made by subtracting kg1 (baseline) from kg3 (end of followup); it then
  • stores this new copy of the dataset (with the extra column) in a new variable called funimagery.edited:

We can then use this new variable, funimagery.edited, to do more work, like making a boxplot:

# boxploot of weight lost at end of follow-up using new column
funimagery.edited %>%
  ggplot(aes(intervention, weight_lost_end_followup)) +
  geom_boxplot()

If anything, it looks like the difference between groups is even BIGGER after follow-up than it was at the end of treatment, which is very promising for FIT.

Exercise 1

  1. Open session-4.rmd using the Files pane. This is the workbook you will be using in this session.
  2. Use group_by() and summarise() with the built-in iris dataset to calculate the mean Sepal.Width for each Species of flower.
  3. Make a boxplot that shows the sepal width for each species of flower.

These are the correct numbers to check your work against:

Species Mean sepal width
setosa 3.4
versicolor 2.8
virginica 3.0

Your plot should look like this:

<<<<<<< HEAD

=======

The aes part of your ggplot code should be:

>>>>>>> a0ff214fa9054d679dea03782282fad991987302